This notebook presents a comprehensive study for electricity demand forecasting of the United Kingdom (UK) using historical data. The goal of this project is to develop a robust and accurate model that can predict future electricity demand patterns, enabling better planning and resource allocation.
# Import general packages
import os
import pandas as pd
import numpy as np
from numpy import ndarray
from datetime import datetime, timedelta
import seaborn as sns
# Local package created to provide clean code
from electricity_forecast.plot_data import interactive_chart, static_chart, distribution_plot, plot_variables, \
plot_seasonal_day_week, plot_seasonal_month_year, plot_features_demand
from electricity_forecast.data_preprocessing import transform_cyclical_features, tranform_label_features
The dataset provided by the UK National Grid operator includes observations of electricity demand (in megawatts) measured in each half-hour of a day from January 2009 until April 2023. Additionally, weather data was obtained from Visual Crossing weather data services, from January 2009 until December 2023, which includes average daily temperature, humidity, windspeed, and other weather variables. The dataset consists of two separate files: one for electricity demand data and the other for weather data.
The following code read data from files and verifies if data satifies the problem description and requirements.
from electricity_forecast.data_loader import DataLoader
# Load and preprocess the data
print(os.getcwd() + '/data/')
data_loader = DataLoader(os.getcwd() + '/data/')
df_energy, df_weather = data_loader.load_data('original')
display(df_energy.info())
display(df_weather.info())
display(df_energy.head(5))
display(df_weather.head(5))
d:\projects\Forecasting.ElectricityDemand.UK/data/ <class 'pandas.core.frame.DataFrame'> RangeIndex: 250942 entries, 0 to 250941 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 250942 non-null int64 1 settlement_date 250942 non-null object 2 settlement_period 250942 non-null int64 3 nd 250942 non-null int64 4 tsd 250942 non-null int64 5 england_wales_demand 250942 non-null int64 6 embedded_wind_generation 250942 non-null int64 7 embedded_wind_capacity 250942 non-null int64 8 embedded_solar_generation 250942 non-null int64 9 embedded_solar_capacity 250942 non-null int64 10 non_bm_stor 250942 non-null int64 11 pump_storage_pumping 250942 non-null int64 12 ifa_flow 250942 non-null int64 13 ifa2_flow 250942 non-null int64 14 britned_flow 250942 non-null int64 15 moyle_flow 250942 non-null int64 16 east_west_flow 250942 non-null int64 17 nemo_flow 250942 non-null int64 18 nsl_flow 75646 non-null float64 19 eleclink_flow 75646 non-null float64 20 is_holiday 250942 non-null int64 dtypes: float64(2), int64(18), object(1) memory usage: 40.2+ MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5478 entries, 0 to 5477 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 5478 non-null object 1 datetime 5478 non-null object 2 tempmax 5478 non-null float64 3 tempmin 5478 non-null float64 4 temp 5478 non-null float64 5 feelslikemax 5478 non-null float64 6 feelslikemin 5478 non-null float64 7 feelslike 5478 non-null float64 8 dew 5478 non-null float64 9 humidity 5478 non-null float64 10 precip 5478 non-null float64 11 precipprob 5478 non-null float64 12 precipcover 5478 non-null float64 13 preciptype 2702 non-null object 14 snow 5336 non-null float64 15 snowdepth 5385 non-null float64 16 windgust 2548 non-null float64 17 windspeed 5478 non-null float64 18 winddir 5478 non-null float64 19 sealevelpressure 5194 non-null float64 20 cloudcover 5478 non-null float64 21 visibility 5478 non-null float64 22 solarradiation 4971 non-null float64 23 solarenergy 4971 non-null float64 24 uvindex 4971 non-null float64 25 severerisk 579 non-null float64 26 sunrise 5478 non-null object 27 sunset 5478 non-null object 28 moonphase 5478 non-null float64 29 conditions 5478 non-null object 30 description 5336 non-null object 31 icon 5478 non-null object 32 stations 5322 non-null object dtypes: float64(24), object(9) memory usage: 1.4+ MB
None
| Unnamed: 0 | settlement_date | settlement_period | nd | tsd | england_wales_demand | embedded_wind_generation | embedded_wind_capacity | embedded_solar_generation | embedded_solar_capacity | ... | pump_storage_pumping | ifa_flow | ifa2_flow | britned_flow | moyle_flow | east_west_flow | nemo_flow | nsl_flow | eleclink_flow | is_holiday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2009-01-01 | 1 | 37910 | 38704 | 33939 | 54 | 1403 | 0 | 0 | ... | 33 | 2002 | 0 | 0 | -161 | 0 | 0 | NaN | NaN | 1 |
| 1 | 1 | 2009-01-01 | 2 | 38047 | 38964 | 34072 | 53 | 1403 | 0 | 0 | ... | 157 | 2002 | 0 | 0 | -160 | 0 | 0 | NaN | NaN | 1 |
| 2 | 2 | 2009-01-01 | 3 | 37380 | 38651 | 33615 | 53 | 1403 | 0 | 0 | ... | 511 | 2002 | 0 | 0 | -160 | 0 | 0 | NaN | NaN | 1 |
| 3 | 3 | 2009-01-01 | 4 | 36426 | 37775 | 32526 | 50 | 1403 | 0 | 0 | ... | 589 | 1772 | 0 | 0 | -160 | 0 | 0 | NaN | NaN | 1 |
| 4 | 4 | 2009-01-01 | 5 | 35687 | 37298 | 31877 | 50 | 1403 | 0 | 0 | ... | 851 | 1753 | 0 | 0 | -160 | 0 | 0 | NaN | NaN | 1 |
5 rows × 21 columns
| name | datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | ... | solarenergy | uvindex | severerisk | sunrise | sunset | moonphase | conditions | description | icon | stations | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United Kingdom | 2009-01-01 | 36.9 | 32.0 | 34.2 | 33.0 | 26.7 | 30.2 | 29.9 | 84.0 | ... | NaN | NaN | NaN | 2009-01-01T08:06:15 | 2009-01-01T16:02:16 | 0.17 | Overcast | Cloudy skies throughout the day. | cloudy | 03769099999,03660099999,03672099999,0378109999... |
| 1 | United Kingdom | 2009-01-02 | 40.3 | 30.9 | 35.7 | 34.3 | 25.5 | 30.8 | 31.0 | 83.4 | ... | NaN | NaN | NaN | 2009-01-02T08:06:07 | 2009-01-02T16:03:21 | 0.20 | Snow, Rain, Partially cloudy | Partly cloudy throughout the day with morning ... | rain | 03769099999,03660099999,03672099999,0378109999... |
| 2 | United Kingdom | 2009-01-03 | 37.5 | 23.2 | 28.9 | 36.0 | 19.3 | 26.6 | 22.7 | 78.3 | ... | NaN | NaN | NaN | 2009-01-03T08:05:55 | 2009-01-03T16:04:29 | 0.24 | Clear | Clear conditions throughout the day. | clear-day | 03769099999,03660099999,03672099999,0378109999... |
| 3 | United Kingdom | 2009-01-04 | 32.2 | 21.6 | 27.3 | 27.4 | 16.0 | 23.6 | 23.8 | 87.0 | ... | NaN | NaN | NaN | 2009-01-04T08:05:40 | 2009-01-04T16:05:39 | 0.25 | Partially cloudy | Partly cloudy throughout the day. | partly-cloudy-day | 03769099999,03660099999,03672099999,0378109999... |
| 4 | United Kingdom | 2009-01-05 | 35.3 | 28.4 | 31.9 | 27.1 | 19.8 | 24.4 | 26.9 | 82.3 | ... | NaN | NaN | NaN | 2009-01-05T08:05:22 | 2009-01-05T16:06:52 | 0.31 | Snow, Rain, Partially cloudy | Partly cloudy throughout the day with afternoo... | rain | 03769099999,03660099999,03672099999,0378109999... |
5 rows × 33 columns
Data format: Electricity demand data requires to remove first column and add time to date for further visual analysis. Weather data is heterogeneous and requires add time to match electricity time format. Electricity demand and weather data were formatted to match datetime column name. Also, some weather variables were selected to simplify the analysis.
data_loader.format_electricity_data()
data_loader.format_weather_data()
# Load and preprocess the data
print(os.getcwd() + '/data/')
data_loader = DataLoader(os.getcwd() + '/data/')
df_energy, df_weather, df_datetime = data_loader.load_data('formatted')
display(df_energy.info())
display(df_weather.info())
display(df_energy.head(5))
display(df_weather.head(5))
d:\projects\Forecasting.ElectricityDemand.UK/data/ <class 'pandas.core.frame.DataFrame'> RangeIndex: 250944 entries, 0 to 250943 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 250944 non-null datetime64[ns] 1 nd 250914 non-null float64 2 tsd 250914 non-null float64 3 england_wales_demand 250914 non-null float64 4 embedded_wind_generation 250914 non-null float64 5 embedded_wind_capacity 250914 non-null float64 6 embedded_solar_generation 250914 non-null float64 7 embedded_solar_capacity 250914 non-null float64 8 non_bm_stor 250914 non-null float64 9 pump_storage_pumping 250914 non-null float64 10 ifa_flow 250914 non-null float64 11 ifa2_flow 250914 non-null float64 12 britned_flow 250914 non-null float64 13 moyle_flow 250914 non-null float64 14 east_west_flow 250914 non-null float64 15 nemo_flow 250914 non-null float64 16 nsl_flow 75638 non-null float64 17 eleclink_flow 75638 non-null float64 dtypes: datetime64[ns](1), float64(17) memory usage: 34.5 MB
None
<class 'pandas.core.frame.DataFrame'> RangeIndex: 262897 entries, 0 to 262896 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 262897 non-null datetime64[ns] 1 tempmax 262897 non-null float64 2 tempmin 262897 non-null float64 3 temp 262897 non-null float64 4 feelslikemax 262897 non-null float64 5 feelslikemin 262897 non-null float64 6 feelslike 262897 non-null float64 7 humidity 262897 non-null float64 8 windspeed 262897 non-null float64 dtypes: datetime64[ns](1), float64(8) memory usage: 18.1 MB
None
| datetime | nd | tsd | england_wales_demand | embedded_wind_generation | embedded_wind_capacity | embedded_solar_generation | embedded_solar_capacity | non_bm_stor | pump_storage_pumping | ifa_flow | ifa2_flow | britned_flow | moyle_flow | east_west_flow | nemo_flow | nsl_flow | eleclink_flow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009-01-01 00:00:00 | 37910.0 | 38704.0 | 33939.0 | 54.0 | 1403.0 | 0.0 | 0.0 | 0.0 | 33.0 | 2002.0 | 0.0 | 0.0 | -161.0 | 0.0 | 0.0 | NaN | NaN |
| 1 | 2009-01-01 00:30:00 | 38047.0 | 38964.0 | 34072.0 | 53.0 | 1403.0 | 0.0 | 0.0 | 0.0 | 157.0 | 2002.0 | 0.0 | 0.0 | -160.0 | 0.0 | 0.0 | NaN | NaN |
| 2 | 2009-01-01 01:00:00 | 37380.0 | 38651.0 | 33615.0 | 53.0 | 1403.0 | 0.0 | 0.0 | 0.0 | 511.0 | 2002.0 | 0.0 | 0.0 | -160.0 | 0.0 | 0.0 | NaN | NaN |
| 3 | 2009-01-01 01:30:00 | 36426.0 | 37775.0 | 32526.0 | 50.0 | 1403.0 | 0.0 | 0.0 | 0.0 | 589.0 | 1772.0 | 0.0 | 0.0 | -160.0 | 0.0 | 0.0 | NaN | NaN |
| 4 | 2009-01-01 02:00:00 | 35687.0 | 37298.0 | 31877.0 | 50.0 | 1403.0 | 0.0 | 0.0 | 0.0 | 851.0 | 1753.0 | 0.0 | 0.0 | -160.0 | 0.0 | 0.0 | NaN | NaN |
| datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009-01-01 00:00:00 | 36.9 | 32.0 | 34.2 | 33.0 | 26.7 | 30.2 | 84.0 | 5.9 |
| 1 | 2009-01-01 00:30:00 | 36.9 | 32.0 | 34.2 | 33.0 | 26.7 | 30.2 | 84.0 | 5.9 |
| 2 | 2009-01-01 01:00:00 | 36.9 | 32.0 | 34.2 | 33.0 | 26.7 | 30.2 | 84.0 | 5.9 |
| 3 | 2009-01-01 01:30:00 | 36.9 | 32.0 | 34.2 | 33.0 | 26.7 | 30.2 | 84.0 | 5.9 |
| 4 | 2009-01-01 02:00:00 | 36.9 | 32.0 | 34.2 | 33.0 | 26.7 | 30.2 | 84.0 | 5.9 |
Partial conclusion: All data fits the requirements stated in the problem. The datetime column in each file has information of the date and time, which could help with visualization of the data.
Next step is to explore the data to get some insights that could help understand how to forecast electricity demand of the UK with this data.
Perform exploratory data analysis (EDA) to understand the distribution, relationships, and patterns in the data. Visualizations can help identify insights and guide feature engineering.
# Print a summary of the dataset
print("Dataset Summary:")
display(df_energy.describe())
display(df_weather.describe())
Dataset Summary:
| datetime | nd | tsd | england_wales_demand | embedded_wind_generation | embedded_wind_capacity | embedded_solar_generation | embedded_solar_capacity | non_bm_stor | pump_storage_pumping | ifa_flow | ifa2_flow | britned_flow | moyle_flow | east_west_flow | nemo_flow | nsl_flow | eleclink_flow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 250944 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.00000 | 250914.000000 | 250914.000000 | 250914.000000 | 250914.000000 | 75638.000000 | 75638.000000 |
| mean | 2016-02-27 23:45:00.000001024 | 31828.702882 | 33215.576309 | 28954.201954 | 1212.054525 | 4210.098966 | 796.052588 | 7755.611014 | 7.406059 | 320.662071 | 918.987211 | 11.42723 | 542.740078 | -107.847824 | -25.244988 | 150.901923 | 184.870753 | -54.207184 |
| min | 2009-01-01 00:00:00 | 13367.000000 | 0.000000 | 0.000000 | 0.000000 | 1403.000000 | 0.000000 | 0.000000 | -24.000000 | 0.000000 | -2056.000000 | -1030.00000 | -1215.000000 | -505.000000 | -585.000000 | -1022.000000 | -1455.000000 | -1028.000000 |
| 25% | 2012-07-30 23:52:30 | 25619.000000 | 27238.000000 | 23297.000000 | 520.000000 | 2085.000000 | 0.000000 | 1819.000000 | 0.000000 | 8.000000 | 208.250000 | 0.00000 | 0.000000 | -251.000000 | -127.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2016-02-27 23:45:00 | 31266.000000 | 32463.000000 | 28422.000000 | 971.000000 | 4152.000000 | 0.000000 | 9300.000000 | 0.000000 | 12.000000 | 1246.000000 | 0.00000 | 766.000000 | -120.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 2019-09-26 23:37:30 | 37537.000000 | 38690.000000 | 34174.000000 | 1648.000000 | 6192.000000 | 730.000000 | 13080.000000 | 0.000000 | 466.000000 | 1899.000000 | 0.00000 | 994.000000 | 45.000000 | 5.000000 | 0.000000 | 366.000000 | 0.000000 |
| max | 2023-04-25 23:30:00 | 59095.000000 | 60147.000000 | 53325.000000 | 5354.000000 | 6574.000000 | 9830.000000 | 13861.000000 | 893.000000 | 2019.000000 | 2066.000000 | 1016.00000 | 1143.000000 | 499.000000 | 504.000000 | 1033.000000 | 1401.000000 | 1002.000000 |
| std | NaN | 7768.776368 | 7697.950323 | 7043.248055 | 925.835390 | 1925.165504 | 1594.304883 | 5484.671920 | 41.235912 | 544.239312 | 1092.471718 | 307.62554 | 507.435010 | 223.073430 | 250.977542 | 387.718689 | 489.713956 | 362.135756 |
| datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| count | 262897 | 262897.000000 | 262897.000000 | 262897.000000 | 262897.000000 | 262897.000000 | 262897.000000 | 262897.000000 | 262897.000000 |
| mean | 2016-07-01 12:00:00.000000768 | 59.935904 | 47.646589 | 53.623939 | 57.563539 | 44.124298 | 52.007846 | 75.021550 | 13.102991 |
| min | 2009-01-01 00:00:00 | 29.900000 | 19.100000 | 26.000000 | 0.000000 | 0.000000 | 16.700000 | 36.000000 | 2.800000 |
| 25% | 2012-10-01 06:00:00 | 51.200000 | 41.100000 | 46.400000 | 50.600000 | 35.800000 | 42.600000 | 67.500000 | 9.900000 |
| 50% | 2016-07-01 12:00:00 | 59.600000 | 47.900000 | 53.500000 | 59.100000 | 44.500000 | 53.000000 | 75.900000 | 12.400000 |
| 75% | 2020-03-31 18:00:00 | 68.300000 | 54.800000 | 61.200000 | 68.300000 | 54.600000 | 61.300000 | 82.900000 | 15.700000 |
| max | 2023-12-31 00:00:00 | 103.700000 | 71.800000 | 86.500000 | 100.700000 | 71.800000 | 90.900000 | 98.900000 | 39.500000 |
| std | NaN | 11.472219 | 9.017741 | 9.805832 | 15.667978 | 13.092974 | 11.913379 | 10.440113 | 4.679502 |
# Plot electricity demand
# interactive_chart(df_energy, start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30)) # High computational load
static_chart(df_energy, start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30))
Seasonal info: The electricity demand chart shows seasonality, so it will be helpfull to consider decompose the time into features that are related to this variable. Thus, some datetime features are extracted related to year, month, day of year, among others. Distribution plots, box plots and violin plots are presented to help with the analysis.
# Create date and time features to analyze seasonality
data_loader.extract_time_features()
df_energy, df_weather, df_datetime = data_loader.load_data('formatted')
display(df_datetime.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 250944 entries, 0 to 250943 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 250944 non-null datetime64[ns] 1 date_period 250944 non-null float64 2 date_isholiday 250944 non-null bool 3 date_year 250944 non-null int64 4 date_month 250944 non-null int64 5 date_day 250944 non-null int64 6 date_hour 250944 non-null int64 7 date_minute 250944 non-null int64 8 date_dayofyear 250944 non-null int64 9 date_weekday 250944 non-null int64 10 date_quarter 250944 non-null int64 11 date_week 250944 non-null int64 12 date_isweekend 250944 non-null bool 13 date_daypart 250944 non-null object dtypes: bool(2), datetime64[ns](1), float64(1), int64(9), object(1) memory usage: 23.5+ MB
None
# Plot electricity demand variables
plot_variables(df_energy, df_energy.columns[1:], start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30))
Based on the previous chart, only the tsd column is considered significant for this task as it represents the target variable (electricity demand). The remaining variables cannot be used as features for the prediction since they are measured at the same time and are not predicted values.
The next code will explore the influence of time features on the demand.
# Analize seasonal components from day and week
df_energy_time = pd.merge(df_datetime, df_energy[['datetime', 'tsd']], on='datetime', how='left')
plot_seasonal_day_week(df_energy_time)
Hourly and Daily Seasonal analysis
There is a clear pattern of electricity demands that depends on the period during the day. This pattern shows how the demand is higher during the afternoon and the evening, and lower at night and in the morning. The demand is similar during weekdays, and different between weekdays and weekends. The reason for this pattern could be the stop from electricity consumption for enterprises during weekends.
The weekend vs holiday chart indicate that on weekends the electricity demand is similar between holdays and non-holidays. However, on weekdays there is a difference between holdays and non-holidays, and holidays values are similar to weekends. This indicates that perhaps weekends and holidays could have the same effects on electricity demand, so it would be interesting exploring this approach in future analysis.
The hour and the half-hour period charts show a cyclical pattern. This means that the demand at the beginning is similar to the demand at the end. For example, the demand at 11pm is similar to the demand at midnight. The other features don't show the same pattern.
Also, minutes hold no relevance in the demand. The half hour periods can be seen as a detailed view of the hourly demand.
Let's analyze now the seasonal month and year components.
# Analize seasonal components from month and year
plot_seasonal_month_year(df_energy_time)
Monthly and Yearly Seasonal analysis
The day of the month chart indicates that this feature does not significantly influence electricity demand. As a result, it could be excluded from the forecast model.
On the other hand, the day of the year, week, month, and quarter exhibit a significant pattern of lower demand during summer and higher demand during winter. This behavior is attributed to the increased energy consumption for heating buildings during the colder months, particularly in the UK where winter temperatures can drop below 20 degrees Fahrenheit (-6.67 degrees Celsius). These features also demonstrate a cyclical pattern, meaning that the demand at the beginning of the year is similar to that at the end. For instance, the demand in January resembles the demand in December. An interesting observation is that these features, presented in the given order, show the same cyclical demand pattern but at progressively more detailed levels, from broader time scales (year) to more granular ones (quarter, month, week).
The year components show a decreasing pattern of the demand, which is consistent with the electricity reduction of the UK stated in the news.
Partial conclusions: All time features at and above the hour level have some degree of impact on the demand, aligning with the time-series approach adopted to address this problem. Hence, these features should be taken into account when constructing the forecasting model.
The next step is to handle missing values and outliers. Thus, the distribution of the demand is plotted.
# Plot electricity demand distribution
distribution_plot(df_energy)
# Detect missing data
datetime_nan = df_energy.loc[df_energy['tsd'].isna(), 'datetime']
print('Timestamps with nan values:')
display(datetime_nan)
datetime_zero = df_energy.loc[df_energy['tsd'] == 0, 'datetime']
print('Timestamps with zero values:')
display(datetime_zero)
Timestamps with nan values:
4222 2009-03-29 23:00:00 4223 2009-03-29 23:30:00 21694 2010-03-28 23:00:00 21695 2010-03-28 23:30:00 39166 2011-03-27 23:00:00 39167 2011-03-27 23:30:00 56638 2012-03-25 23:00:00 56639 2012-03-25 23:30:00 74446 2013-03-31 23:00:00 74447 2013-03-31 23:30:00 91918 2014-03-30 23:00:00 91919 2014-03-30 23:30:00 109390 2015-03-29 23:00:00 109391 2015-03-29 23:30:00 126862 2016-03-27 23:00:00 126863 2016-03-27 23:30:00 144334 2017-03-26 23:00:00 144335 2017-03-26 23:30:00 161806 2018-03-25 23:00:00 161807 2018-03-25 23:30:00 179614 2019-03-31 23:00:00 179615 2019-03-31 23:30:00 197086 2020-03-29 23:00:00 197087 2020-03-29 23:30:00 214558 2021-03-28 23:00:00 214559 2021-03-28 23:30:00 232030 2022-03-27 23:00:00 232031 2022-03-27 23:30:00 249502 2023-03-26 23:00:00 249503 2023-03-26 23:30:00 Name: datetime, dtype: datetime64[ns]
Timestamps with zero values:
11522 2009-08-29 01:00:00
11523 2009-08-29 01:30:00
11524 2009-08-29 02:00:00
11525 2009-08-29 02:30:00
11526 2009-08-29 03:00:00
...
60380 2012-06-11 22:00:00
60381 2012-06-11 22:30:00
60382 2012-06-11 23:00:00
60383 2012-06-11 23:30:00
66337 2012-10-14 00:30:00
Name: datetime, Length: 479, dtype: datetime64[ns]
Missing Data and Outliers
Some measurements are missing from the last two periods of one of the last days of March every year. There appears to be a discernible pattern in the missing data, possibly related to the data acquisition system. To address this issue, it is possible to replace the missing data based on the demand distribution for each period. Additionally, measurements equal to zero are considered missing or outliers and seem to belong to entire days. These zero measurements can also be replaced.
However, the initial approach is to remove observations with missing values for electricity demand. In further analysis, another approach could involve removing the days that have some periods with missing values, or applying imputation strategies to fill in the missing values.
# Handling missing data
data_loader.handle_missing_data('remove_time')
df_energy, df_weather, df_datetime = data_loader.load_data('no_missing')
Weather data
The next step is to analize weather patterns and relationship with the electricity demand.
# Plot weather variables over time
plot_variables(df_weather, df_weather.columns[1:], start_date=datetime(2009, 1, 1), end_date=datetime(2023, 4, 30))
It appears that weather data also has seasonal components. Let's plot the electricity demand against weather features to get some more insights.
# Plot electricity demand vs weather features
df_energy_weather = pd.merge(df_energy[['datetime', 'tsd']], df_weather, on='datetime', how='left')
plot_features_demand(df_energy_weather, df_weather.columns[1:])
As assumed at the beginning, weather data is related to electricity demand. It appears that temperature is the variable with the most influence on the demand. Therefore, all weather features should be included in the forecasting model.
In the process of building the forecast model, data transformation is necessary to create features that can be utilized by the models. Additionally, some new features may be generated. While each model has specific data requirements, there is a common stage of data transformation that is generally applied. This stage involves the following steps:
# Verify time file format
display(df_datetime.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 250944 entries, 0 to 250943 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 250944 non-null datetime64[ns] 1 date_period 250944 non-null float64 2 date_isholiday 250944 non-null bool 3 date_year 250944 non-null int64 4 date_month 250944 non-null int64 5 date_day 250944 non-null int64 6 date_hour 250944 non-null int64 7 date_minute 250944 non-null int64 8 date_dayofyear 250944 non-null int64 9 date_weekday 250944 non-null int64 10 date_quarter 250944 non-null int64 11 date_week 250944 non-null int64 12 date_isweekend 250944 non-null bool 13 date_daypart 250944 non-null object dtypes: bool(2), datetime64[ns](1), float64(1), int64(9), object(1) memory usage: 23.5+ MB
None
# Create cyclical features
cyclical_features = ['date_period', 'date_month', 'date_hour', 'date_dayofyear', 'date_quarter', 'date_week']
df_cyclical = transform_cyclical_features(df_datetime, cyclical_features)
df_cyclical['datetime'] = df_datetime['datetime']
df_cyclical.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 250944 entries, 0 to 250943 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_period_sin 250944 non-null float64 1 date_period_cos 250944 non-null float64 2 date_month_sin 250944 non-null float64 3 date_month_cos 250944 non-null float64 4 date_hour_sin 250944 non-null float64 5 date_hour_cos 250944 non-null float64 6 date_dayofyear_sin 250944 non-null float64 7 date_dayofyear_cos 250944 non-null float64 8 date_quarter_sin 250944 non-null float64 9 date_quarter_cos 250944 non-null float64 10 date_week_sin 250944 non-null float64 11 date_week_cos 250944 non-null float64 12 datetime 250944 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(12) memory usage: 24.9 MB
# Plot cyclical features
df_energy_cyclical = pd.merge(df_energy[['datetime', 'tsd']], df_cyclical, on='datetime', how='left')
plot_features_demand(df_energy_cyclical, df_cyclical.columns[:-1])
Cyclical features: From the cyclical features chart can be asume that not all cyclical features are representative when transformed with sine and cosine. It can be expected that date_dayofyear_cos and date_week_cos are the features that will give more information on the forecast.
Considering these charts, all time features, original and transformed will be included in the model.
The next step is to transform boolean and string features into integers.
# Convert is weekend and is holiday features
df_datetime['date_isweekend'] = df_datetime['date_isweekend'].astype(int)
df_datetime['date_isholiday'] = df_datetime['date_isholiday'].astype(int)
# Convert part of date to int
df_datetime['date_daypart'] = tranform_label_features(df_datetime, ['date_daypart'])
df_datetime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 250944 entries, 0 to 250943 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 250944 non-null datetime64[ns] 1 date_period 250944 non-null float64 2 date_isholiday 250944 non-null int32 3 date_year 250944 non-null int64 4 date_month 250944 non-null int64 5 date_day 250944 non-null int64 6 date_hour 250944 non-null int64 7 date_minute 250944 non-null int64 8 date_dayofyear 250944 non-null int64 9 date_weekday 250944 non-null int64 10 date_quarter 250944 non-null int64 11 date_week 250944 non-null int64 12 date_isweekend 250944 non-null int32 13 date_daypart 250944 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(3), int64(9) memory usage: 23.9 MB
Partial conclusions: The selected features for the forecast model include time features, cyclical features, and weather features. In the modeling step, a feature selection process will be executed to reduce the dimensionality of the feature space.
# Save features dataframe
df_features = pd.merge(df_datetime, df_cyclical, on='datetime', how='left')
df_features = pd.merge(df_features, df_weather, on='datetime', how='left')
data_loader.save_data('features_2009_2023.csv', df_features)
# Save demand dataframe
df_demand = df_energy[['datetime', 'tsd']]
data_loader.save_data('demand_2009_2023.csv', df_demand)
In this section, the chosen models are fitted to predict electricity demand in order to identify models with higher performance. To facilitate this task, we begin by importing necessary packages.
# FbProphet model
from prophet import Prophet
from prophet.serialize import model_to_json, model_from_json
# XGB regression model
from xgboost import XGBRegressor
# Neural network model
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop
from tensorflow_probability.python.layers import IndependentNormal
from keras.metrics import MeanAbsoluteError, MeanAbsolutePercentageError
from keras.callbacks import EarlyStopping
# For time series splitting
from sklearn.model_selection import TimeSeriesSplit
from electricity_forecast.plot_data import plot_timeseries_split
# For model evaluation
from sklearn.metrics import mean_absolute_percentage_error, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
In model creation, experiment were design and executed with different models structures to find the best fit for the problem. These models are:
# Define the model
prophet_model = Prophet()
prophet_model.add_country_holidays(country_name='GB')
# Add weather features to the model
for feature in df_weather.columns[1:]:
prophet_model.add_regressor(feature)
# Define the model
xgb_model = XGBRegressor(n_estimators=500, subsample=0.7, max_depth=5)
# Define loss function for the network
# Negative log likelihood
def negloglik(y_true, y_pred):
return -y_pred.log_prob(y_true)
# The model is defined before training
In this step, the train, test and validation sets are created.
# Split test set
test_split = datetime(2023, 1, 1, 0, 0)
X_test = df_features.loc[df_features['datetime'] >= test_split]
y_test = df_demand.loc[df_demand['datetime'] >= test_split]
# Split train and validation set
validation_split = datetime(2022, 1, 1, 0, 0)
X_train = df_features.loc[df_features['datetime'] < validation_split]
y_train = df_demand.loc[df_demand['datetime'] < validation_split]
X_val = df_features.loc[(df_features['datetime'] >= validation_split) & (df_features['datetime'] < test_split)]
y_val = df_demand.loc[(df_demand['datetime'] >= validation_split) & (df_demand['datetime'] < test_split)]
# Cross validation split
n_years_test = 1
tss = TimeSeriesSplit(n_splits=5, test_size=48 * 365 * n_years_test, gap=48)
plot_timeseries_split(df_demand, tss, test_split)
# Set saving directory for models
model_save_dir = 'model/'
if not os.path.exists(model_save_dir):
os.mkdir(model_save_dir)
model_name = 'fbprophet'
model_file = model_save_dir + 'model_' + model_name + '.json'
if not os.path.exists(model_file):
print('Training model')
train_prophet = pd.merge(y_train, X_train[df_weather.columns], on='datetime', how='left')
train_prophet.rename(columns={'datetime': 'ds', 'tsd': 'y'}, inplace=True)
prophet_model.fit(train_prophet)
print('Saving model')
with open(model_file, 'w') as fout:
fout.write(model_to_json(prophet_model)) # Save model
else:
print('Loading model')
with open(model_file, 'r') as fin:
prophet_model = model_from_json(fin.read()) # Load model
Loading model
print('Testing model')
val_prophet = pd.merge(y_val, X_val[df_weather.columns], on='datetime', how='left')
val_prophet.rename(columns={'datetime': 'ds', 'tsd': 'y'}, inplace=True)
# use the model to make a forecast
val_pred_prophet = prophet_model.predict(val_prophet)
# summarize the forecast
print(val_pred_prophet[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head())
# plot forecast
prophet_model.plot(val_pred_prophet)
from matplotlib import pyplot as plt
# plt.scatter(test_prophet['ds'], test_prophet['y'])
plt.show()
Testing model
ds yhat yhat_lower yhat_upper
0 2022-01-01 00:00:00 18543.137484 15310.722410 21636.183191
1 2022-01-01 00:30:00 17976.745986 14726.250332 21180.560981
2 2022-01-01 01:00:00 17529.476476 14422.087231 20626.766335
3 2022-01-01 01:30:00 17120.734880 13840.901348 20215.794961
4 2022-01-01 02:00:00 16695.922696 13583.588698 19723.230036
print('Evaluate model')
mse = mean_squared_error(val_prophet['y'], val_pred_prophet['yhat'])
rmse = np.sqrt(mean_squared_error(val_prophet['y'], val_pred_prophet['yhat']))
mae = mean_absolute_error(val_prophet['y'], val_pred_prophet['yhat'])
mape = mean_absolute_percentage_error(val_prophet['y'], val_pred_prophet['yhat'])*100
print("MAE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Evaluate model MAE: 12090957.17 RMSE: 3477.21 MAE: 2636.98 MAPE: 9.63%
model_name = 'xgb'
model_file = model_save_dir + 'model_' + model_name + '.model'
train_xgb = pd.merge(y_train, X_train, on='datetime', how='left')
scaler = StandardScaler()
scaler.fit(train_xgb[X_train.columns[1:]])
if not os.path.exists(model_file):
print('Training model')
xgb_model = xgb_model.fit(scaler.transform(train_xgb[X_train.columns[1:]]), train_xgb[y_train.columns[1:]])
print('Saving model')
xgb_model.save_model(model_file)
else:
print('Loading model')
xgb_model.load_model(model_file)
Loading model
print('Testing model')
# use the model to make a forecast
val_xgb = pd.merge(y_val, X_val, on='datetime', how='left')
val_pred_xgb = xgb_model.predict(scaler.transform(val_xgb[X_val.columns[1:]]))
plot_count = 1440
# plot prediction
plt.figure(figsize=(15,5))
plt.plot(val_xgb['datetime'].iloc[:plot_count], val_pred_xgb[:plot_count])
plt.scatter(val_xgb['datetime'].iloc[:plot_count], val_pred_xgb[:plot_count])
plt.show()
print('Evaluate model')
mse = mean_squared_error(y_val['tsd'], val_pred_xgb)
rmse = np.sqrt(mean_squared_error(y_val['tsd'], val_pred_xgb))
mae = mean_absolute_error(y_val['tsd'], val_pred_xgb)
mape = mean_absolute_percentage_error(y_val['tsd'], val_pred_xgb)*100
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Testing model
Evaluate model MSE: 4609979.45 RMSE: 2147.09 MAE: 1718.80 MAPE: 6.01%
print('Feature importance')
feature_names = X_train.columns[1:]
importances = xgb_model.feature_importances_.reshape(-1,1)
model_importances = pd.DataFrame(importances, columns=['importance'], index=feature_names)
model_importances.sort_values(by='importance', ascending=False, inplace=True)
model_importances.to_csv(model_save_dir + 'features_' + model_name + '.csv')
fig, ax = plt.subplots()
model_importances['importance'].plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
Feature importance
Feature Selection: This model can be employed to select features for training other regression models. The code below accomplishes this task.
print('Feature selection')
# Fit model using each importance as a threshold
thresholds = np.sort(xgb_model.feature_importances_)
print('Retraining model')
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(xgb_model, threshold=thresh, prefit=True)
select_X_train = selection.transform(scaler.transform(train_xgb[X_train.columns[1:]]))
# train model
selection_model_file = model_save_dir + 'model_' + model_name + '_th_{:.5f}'.format(thresh) + '.model'
selection_model = XGBRegressor(n_estimators=500, subsample=0.7, max_depth=5)
if os.path.exists(selection_model_file):
selection_model.load_model(selection_model_file)
else:
selection_model.fit(select_X_train, train_xgb[y_train.columns[1:]])
selection_model.save_model(selection_model_file)
# eval model
select_X_val = selection.transform(scaler.transform(val_xgb[X_val.columns[1:]]))
predictions = selection_model.predict(select_X_val)
mae = mean_absolute_error(y_val['tsd'], predictions)
mape = mean_absolute_percentage_error(y_val['tsd'], predictions)*100
rmse = np.sqrt(mean_squared_error(y_val['tsd'], predictions))
print("Thresh=%.5f, n=%d, MAE: %.2f, MAPE: %.2f%%, RMSE: %.2f" % (thresh, select_X_train.shape[1], mae, mape, rmse))
Feature selection Retraining model Thresh=0.00000, n=33, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09 Thresh=0.00000, n=33, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09 Thresh=0.00000, n=33, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09 Thresh=0.00053, n=30, MAE: 1718.80, MAPE: 6.01%, RMSE: 2147.09 Thresh=0.00112, n=29, MAE: 1713.54, MAPE: 5.99%, RMSE: 2134.43 Thresh=0.00120, n=28, MAE: 1716.30, MAPE: 6.00%, RMSE: 2137.62 Thresh=0.00143, n=27, MAE: 1677.97, MAPE: 5.86%, RMSE: 2089.65 Thresh=0.00156, n=26, MAE: 1722.73, MAPE: 6.02%, RMSE: 2141.26 Thresh=0.00165, n=25, MAE: 1678.72, MAPE: 5.89%, RMSE: 2086.31 Thresh=0.00243, n=24, MAE: 1676.84, MAPE: 5.85%, RMSE: 2096.71 Thresh=0.00256, n=23, MAE: 1706.15, MAPE: 5.97%, RMSE: 2125.37 Thresh=0.00294, n=22, MAE: 1620.24, MAPE: 5.66%, RMSE: 2030.68 Thresh=0.00337, n=21, MAE: 1640.32, MAPE: 5.73%, RMSE: 2066.48 Thresh=0.00350, n=20, MAE: 1690.26, MAPE: 5.90%, RMSE: 2113.86 Thresh=0.00389, n=19, MAE: 1679.54, MAPE: 5.86%, RMSE: 2109.93 Thresh=0.00572, n=18, MAE: 1722.80, MAPE: 6.01%, RMSE: 2166.05 Thresh=0.00588, n=17, MAE: 1705.73, MAPE: 5.96%, RMSE: 2170.09 Thresh=0.00608, n=16, MAE: 1672.67, MAPE: 5.84%, RMSE: 2114.26 Thresh=0.00707, n=15, MAE: 1712.55, MAPE: 5.98%, RMSE: 2150.71 Thresh=0.00816, n=14, MAE: 1724.51, MAPE: 6.02%, RMSE: 2165.35 Thresh=0.00986, n=13, MAE: 1730.37, MAPE: 6.05%, RMSE: 2183.54 Thresh=0.01484, n=12, MAE: 1737.90, MAPE: 6.08%, RMSE: 2200.22 Thresh=0.02215, n=11, MAE: 1746.89, MAPE: 6.13%, RMSE: 2209.54 Thresh=0.03067, n=10, MAE: 1706.30, MAPE: 5.98%, RMSE: 2155.33 Thresh=0.03797, n=9, MAE: 1740.24, MAPE: 6.11%, RMSE: 2228.04 Thresh=0.04493, n=8, MAE: 1742.20, MAPE: 6.13%, RMSE: 2196.71 Thresh=0.04535, n=7, MAE: 1790.80, MAPE: 6.30%, RMSE: 2263.86 Thresh=0.08226, n=6, MAE: 1830.29, MAPE: 6.44%, RMSE: 2295.78 Thresh=0.08327, n=5, MAE: 4314.65, MAPE: 14.65%, RMSE: 5213.82 Thresh=0.09236, n=4, MAE: 4781.58, MAPE: 16.25%, RMSE: 5549.52 Thresh=0.14227, n=3, MAE: 4270.03, MAPE: 15.07%, RMSE: 5302.25 Thresh=0.15468, n=2, MAE: 4282.35, MAPE: 15.13%, RMSE: 5339.39 Thresh=0.18032, n=1, MAE: 4831.83, MAPE: 18.19%, RMSE: 6149.94
# Select features based on threshold
selected_threshold = 0.0029
selected_features = model_importances.loc[model_importances['importance'] >= selected_threshold]
selected_features.to_csv(model_save_dir + 'features_' + model_name + '_th_{:.5f}'.format(selected_threshold) + '.csv')
selected_features.index
Index(['date_dayofyear_cos', 'date_period', 'date_week_cos', 'date_weekday',
'tempmax', 'date_year', 'date_isholiday', 'temp', 'date_daypart',
'date_hour_sin', 'date_period_sin', 'date_dayofyear', 'date_week',
'date_month_cos', 'date_month_sin', 'date_quarter_cos', 'date_hour_cos',
'date_period_cos', 'humidity', 'date_quarter_sin', 'tempmin',
'date_week_sin'],
dtype='object')
print('Loading model')
xgb_model.load_model(model_save_dir + 'model_' + model_name + '_th_{:.5f}'.format(0.00294) + '.model')
print('Testing model')
# use the model to make a forecast
test_xgb = pd.merge(y_test, X_test, on='datetime', how='left')
scaler = scaler.fit(train_xgb[selected_features.index.to_list()]) # Fit scaler
test_pred_xgb = xgb_model.predict(scaler.transform(test_xgb[selected_features.index.to_list()]))
plot_count = 1440
print('Evaluate model')
mse = mean_squared_error(y_test['tsd'], test_pred_xgb)
rmse = np.sqrt(mean_squared_error(y_test['tsd'], test_pred_xgb))
mae = mean_absolute_error(y_test['tsd'], test_pred_xgb)
mape = mean_absolute_percentage_error(y_test['tsd'], test_pred_xgb)*100
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Loading model Testing model Evaluate model MSE: 94463722.92 RMSE: 9719.24 MAE: 8104.76 MAPE: 30.04%
# Define the model
gmlp_model = Sequential()
gmlp_model.add(Dense(32, activation='relu', kernel_initializer='random_uniform', input_shape=(len(selected_features.index.to_list()),)))
gmlp_model.add(Dense(IndependentNormal.params_size(1), activation='softplus', kernel_initializer='random_uniform'))
gmlp_model.add(IndependentNormal(event_shape=1))
opt = RMSprop(learning_rate=0.1)
gmlp_model.compile(optimizer=opt,
loss=negloglik,
metrics=[MeanAbsolutePercentageError(), MeanAbsoluteError()])
model_name = 'gmlp'
model_file = model_save_dir + 'model_' + model_name + '.h5'
if not os.path.exists(model_file):
print('Training model')
train_gmlp = pd.merge(y_train, X_train, on='datetime', how='left')
scaler = StandardScaler()
es = EarlyStopping(monitor='val_loss', patience=50)
gmlp_model.fit(scaler.fit_transform(train_gmlp[selected_features.index.to_list()]), train_gmlp[y_train.columns[1:]],
batch_size=100, epochs=30000, validation_split=0.1, callbacks=es)
print('Saving model')
gmlp_model.save_weights(model_file)
else:
print('Loading model')
gmlp_model.load_weights(model_file)
Loading model
def plot_distribution(index: ndarray, samples: ndarray, mode: ndarray, lower_bound: ndarray, upper_bound: ndarray,
mean: ndarray, median: ndarray = None) -> None:
"""
:rtype: object
:param index:
:param samples:
:param mean:
:param lower_bound:
:param upper_bound:
:param mode:
:param median:
"""
plt.figure(figsize=(15, 10))
plt.scatter(index, samples, label='True value', alpha=.5, marker=".")
plt.plot(index, mode, label='Mode value', color='r')
plt.plot(index, mean, label='Mean value', color='m')
if median is not None:
plt.scatter(index, median, label='Median value', c='y', marker=".")
plt.fill_between(np.squeeze(index), np.squeeze(lower_bound), np.squeeze(upper_bound), alpha=.3,
label='$CI$')
plt.xlabel('Timestamp')
plt.ylabel('Electricity demand (MW)')
plt.title('Samples & fitted distribution\n+ 95% CIs')
plt.legend()
plt.show()
print('Testing model')
# use the model to make a forecast
val_gmlp = pd.merge(y_val, X_val, on='datetime', how='left')
val_pred_gmlp = gmlp_model(scaler.transform(val_gmlp[selected_features.index.to_list()]))
plot_count = 1440*12
# Get predictions - E[Y|X]
y_hat = val_pred_gmlp.mean()
# Get SD
y_sd = val_pred_gmlp.stddev()
# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd
# Plot data, predicted line and ~95% conf ints
plot_distribution(val_gmlp['datetime'].iloc[:plot_count],
val_gmlp['tsd'].iloc[:plot_count],
y_hat[:plot_count],
y_hat_lower[:plot_count],
y_hat_upper[:plot_count],
y_hat[:plot_count])
print('Evaluate model')
mse = mean_squared_error(val_gmlp['tsd'], y_hat)
rmse = np.sqrt(mean_squared_error(val_gmlp['tsd'], y_hat))
mae = mean_absolute_error(val_gmlp['tsd'], y_hat)
mape = mean_absolute_percentage_error(val_gmlp['tsd'], y_hat)*100
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
Testing model
Evaluate model MSE: 11571126.39 RMSE: 3401.64 MAE: 2772.52 MAPE: 9.89%
The error exhibits a higher magnitude during the summer season.
print('Testing model')
# use the model to make a forecast
test_gmlp = pd.merge(y_test, X_test, on='datetime', how='left')
test_pred_gmlp = gmlp_model(scaler.transform(test_gmlp[selected_features.index.to_list()]))
plot_count = 1440
# Get predictions - E[Y|X]
y_hat = test_pred_gmlp.mean()
# Get SD
y_sd = test_pred_gmlp.stddev()
# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd
# Plot data, predicted line and ~95% conf ints
plot_distribution(test_gmlp['datetime'].iloc[:plot_count],
test_gmlp['tsd'].iloc[:plot_count],
y_hat[:plot_count],
y_hat_lower[:plot_count],
y_hat_upper[:plot_count],
y_hat[:plot_count])
print('Evaluate model')
mae = mean_absolute_error(test_gmlp['tsd'], y_hat)
mape = mean_absolute_percentage_error(test_gmlp['tsd'], y_hat)*100
mse = mean_squared_error(test_gmlp['tsd'], y_hat)
rmse = np.sqrt(mean_squared_error(test_gmlp['tsd'], y_hat))
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
# Residual analysis
y_res = y_hat - np.array(test_gmlp['tsd']).reshape(-1,1)
fig, axes = plt.subplots(4, 1, figsize=(15, 15))
axes[0].scatter(test_gmlp['date_period'], y_res)
axes[1].scatter(test_gmlp['date_month'], y_res)
axes[2].scatter(test_gmlp['date_dayofyear'], y_res)
axes[3].scatter(test_gmlp['date_weekday'], y_res)
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
axes[0,0].scatter(test_gmlp['datetime'], y_res)
axes[0,1].scatter(y_hat, y_res)
sns.kdeplot(y_res, ax=axes[1,0])
axes[1,1].scatter(y_hat, test_gmlp['tsd'])
plt.show()
Testing model
Evaluate model MSE: 3955876.87 RMSE: 1988.94 MAE: 1586.76 MAPE: 5.34%
In this step the GMLP model robustness is evaluated using the cross validation procedure.
X_cv = df_features.copy()
y_cv = df_demand.copy()
fold = 0
scores = []
metrics = ['MSE', 'RMSE', 'MAE', 'MAPE']
df_cv = pd.DataFrame(columns=metrics, index=range(5))
# Perform cross-validation
start_year = test_split.year - 5
for fold in range(5):
# Split the data into training and validation sets
fold_date = datetime(start_year + fold, 1, 1)
print('--------------------')
print('Fold ' + str(fold))
print('Training from ' + X_cv['datetime'].min().strftime('%Y/%m/%d') + ' to ' + fold_date.strftime('%Y/%m/%d'))
print('Testing from ' + fold_date.strftime('%Y/%m/%d') + ' to ' + test_split.strftime('%Y/%m/%d'))
print('--------------------')
X_train_fold = X_cv.loc[X_cv['datetime'] < fold_date]
y_train_fold = y_cv.loc[y_cv['datetime'] < fold_date]
X_val_fold = X_cv.loc[(X_cv['datetime'] >= fold_date) & (X_cv['datetime'] < test_split)]
y_val_fold = y_cv.loc[(y_cv['datetime'] >= fold_date) & (y_cv['datetime'] < test_split)]
# Set model filename
model_name = 'gmlp'
model_file = model_save_dir + 'model_' + model_name + '_cv_' + str(fold) + '.h5'
# Define model
gmlp_model = Sequential()
gmlp_model.add(Dense(32, activation='relu', kernel_initializer='random_uniform', input_shape=(len(selected_features.index.to_list()),)))
gmlp_model.add(Dense(IndependentNormal.params_size(1), activation='softplus', kernel_initializer='random_uniform'))
gmlp_model.add(IndependentNormal(event_shape=1))
# Compile and train the model
opt = RMSprop(learning_rate=0.1)
gmlp_model.compile(optimizer=opt,
loss=negloglik,
metrics=[MeanAbsoluteError()])
if not os.path.exists(model_file):
print('Training model')
train_gmlp = pd.merge(y_train_fold, X_train_fold, on='datetime', how='left')
scaler = StandardScaler()
es = EarlyStopping(monitor='val_loss', patience=50)
gmlp_model.fit(scaler.fit_transform(train_gmlp[selected_features.index.to_list()]), train_gmlp[y_train.columns[1:]],
batch_size=100, epochs=300, validation_split=0.2, callbacks=es)
print('Saving model')
gmlp_model.save_weights(model_file)
else:
print('Loading model')
gmlp_model.load_weights(model_file)
# Evaluate the model on the validation set
val_fold_gmlp = pd.merge(y_val_fold, X_val_fold, on='datetime', how='left')
print('Testing model')
# use the model to make a forecast
val_pred_gmlp = gmlp_model(scaler.transform(val_fold_gmlp[selected_features.index.to_list()]))
plot_count = 1440
# Get predictions - E[Y|X]
y_hat = val_pred_gmlp.mean()
# Get SD
y_sd = val_pred_gmlp.stddev()
# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd
print('Evaluate model')
mae = mean_absolute_error(val_fold_gmlp['tsd'], y_hat)
mape = mean_absolute_percentage_error(val_fold_gmlp['tsd'], y_hat)*100
mse = mean_squared_error(val_fold_gmlp['tsd'], y_hat)
rmse = np.sqrt(mse)
df_cv.iloc[fold, :] = np.array([mse, rmse, mae, mape])
print("MSE: %.2f" % (mse))
print("RMSE: %.2f" % (rmse))
print("MAE: %.2f" % (mae))
print("MAPE: %.2f%%" % (mape))
fold += 1
-------------------- Fold 0 Training from 2009/01/01 to 2018/01/01 Testing from 2018/01/01 to 2023/01/01 -------------------- Loading model Testing model Evaluate model MSE: 12909420.71 RMSE: 3592.97 MAE: 2929.37 MAPE: 10.07% -------------------- Fold 1 Training from 2009/01/01 to 2019/01/01 Testing from 2019/01/01 to 2023/01/01 -------------------- Loading model Testing model Evaluate model MSE: 14338705.79 RMSE: 3786.65 MAE: 3150.55 MAPE: 11.10% -------------------- Fold 2 Training from 2009/01/01 to 2020/01/01 Testing from 2020/01/01 to 2023/01/01 -------------------- Loading model Testing model Evaluate model MSE: 7534314.98 RMSE: 2744.87 MAE: 2137.82 MAPE: 7.63% -------------------- Fold 3 Training from 2009/01/01 to 2021/01/01 Testing from 2021/01/01 to 2023/01/01 -------------------- Loading model Testing model Evaluate model MSE: 5860005.86 RMSE: 2420.74 MAE: 1897.86 MAPE: 6.68% -------------------- Fold 4 Training from 2009/01/01 to 2022/01/01 Testing from 2022/01/01 to 2023/01/01 -------------------- Loading model Testing model Evaluate model MSE: 9651613.76 RMSE: 3106.70 MAE: 2495.94 MAPE: 8.90%
In this section, plot are created to visualize the prediction of the GMLP model over the year 2023.
# use the model to make a forecast
test_gmlp = pd.merge(y_test, X_test, on='datetime', how='left')
test_pred_gmlp = gmlp_model(scaler.transform(test_gmlp[selected_features.index.to_list()]))
# Get predictions - E[Y|X]
y_hat = test_pred_gmlp.mean()
# Get SD
y_sd = test_pred_gmlp.stddev()
# Compute conf ints
y_hat_lower = y_hat - 2 * y_sd
y_hat_upper = y_hat + 2 * y_sd
# Plot data, predicted line and ~95% conf ints
plot_distribution(test_gmlp['datetime'],
test_gmlp['tsd'],
y_hat,
y_hat_lower,
y_hat_upper,
y_hat)
plot_count = 48*(7 + 1) # Plot first week and first day of year
# Plot data, predicted line and ~95% conf ints
plot_distribution(test_gmlp['datetime'].iloc[:plot_count],
test_gmlp['tsd'].iloc[:plot_count],
y_hat[:plot_count],
y_hat_lower[:plot_count],
y_hat_upper[:plot_count],
y_hat[:plot_count])
In conclusion, this notebook presents the results and findings of the UK electricity demand forecasting project. Through comprehensive analysis and experimentation, several key insights have been obtained:
After thorough evaluation, the GMLP model emerged as the best performer among the considered models. It exhibited the lowest values for key evaluation metrics, produced normally distributed residuals, and efficiently handled uncertainty. These characteristics solidify the GMLP model as the preferred choice for accurate electricity demand forecasting.